Manual rollback after grace period#9643
Conversation
|
|
|
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
|
This pull request is now in conflicts. Could you fix it? 🙏 |
33afefd to
1c65022
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
74ae1d1 to
c51fb1b
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
5199c2b to
38b9b96
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
4c89a3b to
411205b
Compare
|
|
This pull request is now in conflicts. Could you fix it? 🙏 |
29c1a6b to
cb90912
Compare
e18f176 to
2dbed3f
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
fc1c983 to
dcf0ede
Compare
|
buildkite test this |
|
Code looks good to me, I went through the manual test steps and it worked with the following observations:
|
|
@ycombinator still has open comments to resolve, @ycombinator can you confirm these are addressed and approve? This LGTM but I want to avoid approving when there are open comments from someone else remaining. |
ycombinator
left a comment
There was a problem hiding this comment.
@pchila Thanks for addressing my comments. I've resolved most of them, just a couple where I'd like to see a bit more in the inline comments explaining the "why" + one other question about some code. After that, this LGTM!
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @pchila |
* Allow for multiple directories to be specified during cleanup * refactor manual rollback function and tests on a separate file * Split manual rollback between watching and non-watching cases * Implement manual rollback from list of agent installs * fix lint errors * Normalize install descriptors at startup * Add integration test for manual rollback after grace period * fix linter errors * Set commit hash in TTLMarker when preparing available rollbacks * Pass versionedHomesToKeep to installModifier.Cleanup() * change check for running TTL marker normalization at startup * remove references to install registry * implement code review feedback * fixup! implement code review feedback


What does this PR do?
This PR builds upon #8767 by allowing to manually rollback an elastic agent even after the grace period ends.
This PR uses a registry of elastic-agent available rollbacks introduced with PR #10344.
Having a registry of agent versions available for rollback allows for detection of possible manual rollback targets. The same available rollbacks are written in
.update-markerfile so that the watcher will preserve them at the end of the grace period, still allowing for a rollback after the watcher exits.Installs available for rollbacks are assigned a TTL, so they can be cleaned up after a given time (governed by the
agent.upgrade.rollback.windowsetting).With this PR, cleanup and normalization of agent installs happens only at startup.
In a follow-up PR:
Why is it important?
This PR allows to manually rollback an elastic-agent upgrade within a configurable window that extends beyond the grace period (the period during which an automatic rollback may be triggered by the upgraded agent misbehaving).
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragmentsusing the changelog toolDisruptive User Impact
No impact for users as the feature is deactivated when using the default value of
agent.upgrade.rollback.window: 0How to test this PR locally
SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz PLATFORMS="linux/amd64" mage -v package9.3.0-SNAPSHOTas usualWait for the new agent to come online and check the upgrade details for UPG_WATCHING state. After the grace period of 1 minute, upgrade details should disappear and the watcher should exit.
Verify that the
data/elastic-agent-9.3.0-SNAPSHOT-<hash>directory is still present after watcher exited.Manually rollback to the previous version:
elastic-agent statusand verify that the agent restarted with version9.2.0-SNAPSHOTand that the directorydata/elastic-agent-9.3.0+build20251022000000-<hash>contains only logs.Related issues
Questions to ask yourself